Can you count the number of sequences in the data/proteome.faa
file?
In [1]:
from Bio import SeqIO
counter = 0
for seq in SeqIO.parse('../data/proteome.faa', 'fasta'):
counter += 1
counter
Out[1]:
Can you plot the distribution of protein sizes in the data/proteome.faa
file?
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
In [8]:
sizes = []
for seq in SeqIO.parse('../data/proteome.faa', 'fasta'):
sizes.append(len(seq))
plt.hist(sizes, bins=100)
plt.xlabel('protein size')
plt.ylabel('count');
Can you count the number of CDS sequences in the data/ecoli.gbk
file?
In [9]:
counter = 0
for seq in SeqIO.parse('../data/ecoli.gbk', 'genbank'):
for feat in seq.features:
if feat.type == 'CDS':
counter += 1
counter
Out[9]:
Can you compute the average root-to-tip distance in the data/tree.nwk
file?
In [11]:
from Bio import Phylo
tree = Phylo.read('../data/tree.nwk', 'newick')
distances = []
for node in tree.get_terminals():
distances.append(tree.distance(tree.root, node))
sum(distances)/float(len(distances))
Out[11]:
Can you read the yeast protein interaction network in data/yeast.gml
? Can you plot the degree distribution of the proteins contained in the graph?
In [15]:
import networkx as nx
In [16]:
graph = nx.read_gml('../data/yeast.gml')
In [23]:
plt.hist(nx.degree(graph).values(), bins=20)
plt.xlabel('degree')
plt.ylabel('count');